Build tent#2089
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the build system and CUDA transport logic by introducing a local CUDA build script, making unit test compilation optional, and improving the NVLink transport layer with CUDA device context management and VMM allocation support. Feedback from the review focuses on improving script portability by removing hardcoded user paths, addressing a security vulnerability in LD_LIBRARY_PATH construction, and ensuring robust error handling for CUDA driver API calls.
| namespace { | ||
| constexpr uint8_t kRedisMaxDbIndex = 255; | ||
| constexpr uint8_t kRedisDefaultDbIndex = 0; | ||
| } | ||
|
|
There was a problem hiding this comment.
Why add these constants instead of reusing REDIS_DEFAULT_DB_INDEX in elseware?
There was a problem hiding this comment.
We are building without USE_REDIS?
|
CI failing with |
You can push an empty commit to trigger CI again |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
* Build with TENT * Fix TENT failed start * Revert * Format * Empty
Description
Fixes needed to build and run TENT without
USE_REDIS, plus NVLink transport robustness fixes.fabric_allocator.cmake: AddPOST_BUILDso the fabric allocator's custom build script runs after the target is built, not before.tent/CMakeLists.txt: Gatetests/behindBUILD_UNIT_TESTSso TENT can build with unit tests disabled.tent/src/runtime/transfer_engine_impl.cpp: Drop thetent/metastore/redis.hinclude and replace theREDIS_*_DB_INDEXmacros with localconstexprconstants — the header isn't compiled whenUSE_REDIS=OFF, but the DB-index validation is still wanted.tent/src/transport/nvlink/nvlink_transport.cpp:cudaIpcGetMemHandle,cuMemGetAddressRange) so they run on the device the buffer was allocated on, then restore the caller's device.cuMemCreate) pointers viacuMemRetainAllocationHandleand skip CUDA IPC export for them, sincecudaIpcGetMemHandleonly supportscudaMalloc-backed memory.cudaIpcGetMemHandlefails, instead of just propagating the macro failure.Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Built TENT locally with
USE_REDIS=OFF,BUILD_UNIT_TESTS=OFF,USE_CUDA=ON,USE_MNNVL=ONvia `scripts/build_local_cuda_tent.sh`. Exercised the NVLink transport against PyTorch-allocated tensors (caching allocator sub-allocations) and against driver-allocated VMM buffers to confirm both paths are handled.Checklist